CLGVSM: Adapting Generalized Vector Space Model to Cross-lingual Document Clustering

نویسندگان

  • Guoyu Tang
  • Yunqing Xia
  • Min Zhang
  • Haizhou Li
  • Fang Zheng
چکیده

Cross-lingual document clustering (CLDC) is the task to automatically organize a large collection of cross-lingual documents into groups considering content or topic. Different from the traditional hard matching strategy, this paper extends traditional generalized vector space model (GVSM) to handle cross-lingual cases, referred to as CLGVSM, by incorporating cross-lingual word similarity measures. With this model, we further compare different word similarity measures in cross-lingual document clustering. To select cross-lingual features effectively, we also propose a softmatching based feature selection method in CLGVSM. Experimental results on benchmarking data set show that (1) the proposed CLGVSM is very effective for cross-document clustering, outperforming the two strong baselines vector space model (VSM) and latent semantic analysis (LSA) significantly; and (2) the new feature selection method can further improve CLGVSM.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Document Representation with Statistical Word Senses in Cross-Lingual Document Clustering

Cross-lingual document clustering is the task of automatically organizing a large collection of multi-lingual documents into a few clusters, depending on their content or topic. It is well known that language barrier and translation ambiguity are two challenging issues for cross-lingual document representation. To this end, we propose to represent cross-lingual documents through statistical wor...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Cross-Lingual Document Clustering

The ever-increasing numbers of Web-accessible documents are available in languages other than English. The management of these heterogeneous document collections has posed a challenge. This paper proposes a novel model, called a domain alignment translation model, to conduct cross-lingual document clustering. While most existing crosslingual document clustering methods make use of an expensive ...

متن کامل

Multilingual document clusters discovery

Cross Language Information Retrieval community has brought up search engines over multilingual corpora, and multilingual text categorization systems. In this paper, we focus on the multilingual clusters discovery problem, which aim is to extract topic-related multilingual document clusters from a multilingual document collection in an unsupervised way. Our approach is based on a linguistic anal...

متن کامل

Probabilistic topic modeling in multilingual settings: An overview of its methodology and applications

Probabilistic topic models are unsupervised generative models which model document content as a two-step generation process, that is, documents are observed as mixtures of latent concepts or topics, while topics are probability distributions over vocabulary words. Recently, a significant research effort has been invested into transferring the probabilistic topic modeling concept from monolingua...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011